As Professor Gaston has introduced us two important methods of contacting data, Data Manipulation, and Data Visualization, we have already learned two basic and functional packages, dplyr and ggplot2. However, they both have some restrictions when we approch to more sophisticated data and want more interactive graphs. Fortunately, there are two packages available in cran library–reshape2 and plotly.
reshape2 provides a way to transform wide data structures to long forms, which are more easily to be analyzed in most functions.
plotly introduces a reader-friendly and interactive plotting methods. Although embedded in the file, the readers still can test and try their own values in order to understand the functions.
(All the datasets I used as examples are build-in R objects. You can feel free to copy and paste codes to your Rstudio and try the functions by yourself! ^^)
# install `reshape2`
install.packages('reshape2')
#install `plotly`
# (install from CRAN)
install.packages("plotly")
# (install the latest development version (on GitHub) via devtools)
devtools::install_github("ropensci/plotly")
library('reshape2')
library('plotly')
reshape2reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.
Wide data has a column for each variable. For example, this is wide-format data:
# airquality is a build-in R dataset.
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
And below is a long-format data:
head(melt(airquality))
## variable value
## 1 Ozone 41
## 2 Ozone 36
## 3 Ozone 12
## 4 Ozone 18
## 5 Ozone NA
## 6 Ozone 28
Long-format data has a column for possible variable types and a column for the values of those variables. But we can give long-format data different levels of “longness”. That is, thw long-format data does not necessarily have only two columns. For example, we could have ozone mesaurements for each day of the year (the detailed instructions will be in Discussion section).
When we say tidy data, in most situations, we mean long-format data. Long-format data is more professional and technical in data analysis. For example, ggplot2 requires long-format data, plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data. That is why we need reshape2 to help transform data dormats.
plotlyplotly is an R package for creating interactive web graphics via the open source JavaScript graphing library plotly.js.
reshape2 packageThere are two key functions in reshape2: melt() and cast()
melt() – wide- to long-format datamelt() can deal with wide-format data and transform them into long-format data. By default, melt() has assumed that all columns with numeric values are variables with values. Often this is what we want. But, sometimes we want to know more about the values of ozone, solar.r, wind, and temp for each month and day. melt() has embedded commands to allow the operation.
First let me introduce the command “ID variables”, id.vars. ID variables are the variables that identify individual rows of data.
names(airquality) <- tolower(names(airquality))
aql <- melt(airquality, id.vars = c("month", "day"))
head(aql)
## month day variable value
## 1 5 1 ozone 41
## 2 5 2 ozone 36
## 3 5 3 ozone 12
## 4 5 4 ozone 18
## 5 5 5 ozone NA
## 6 5 6 ozone 28
Also, we can control the column names of long-format data viamelt():
aql <- melt(airquality, id.vars = c("month", "day"),
variable.name = "climate_variable",
value.name = "climate_value")
head(aql)
## month day climate_variable climate_value
## 1 5 1 ozone 41
## 2 5 2 ozone 36
## 3 5 3 ozone 12
## 4 5 4 ozone 18
## 5 5 5 ozone NA
## 6 5 6 ozone 28
cast – long- to wide-format datacast has many versions of functions for different transformation demands. For example, we can use dcast() to transform into data.frame objects. Also we can use acast() to get a vector, an array, or a martix.In the following example, we use dcast() to recover the aql data we got before via melt() function.
aql <- melt(airquality, id.vars = c("month", "day"))
aqw <- dcast(aql, month + day ~ variable)
head(aqw)
## month day ozone solar.r wind temp
## 1 5 1 41 190 7.4 67
## 2 5 2 36 118 8.0 72
## 3 5 3 12 149 12.6 74
## 4 5 4 18 313 11.5 62
## 5 5 5 NA NA 14.3 56
## 6 5 6 28 NA 14.9 66
Note that month + day ~ variable represents that month and day are the ID variables, and we want a column for each; then we tell dcast that variable describes the measured variables.
Then we compare the aqw with the original data format airquality.
head(airquality)
## ozone solar.r wind temp month day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
The only difference is that those two datasets have different orders of columns. The scholar Sean C. Anderson has created a figure to illustrate the internal transformation of the data sets. Here is the figure: Sean uses the blue shading to indicate ID variables that we want to represent individual rows, and the red shading represents variable names that we want to swing into column names, and the grey shading represents the data values that we want to fill in the cells with.
plotly packageFirst, let us take a look at a function that calls interactive plots from our old friend, ggplot2. In ggplot2, ggplotly() converts your plots to an interactive, web-based version! It also provides sensible tooltips, which assists decoding of values encoded as visual properties in the plot.
library(ggplot2)
# `faithful` is a build-in R dataset
webplot <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
xlim(1, 6) + ylim(40, 100)
ggplotly(webplot)
Then let us see the plotly version of interactive plots. By default, Plotly for R runs locally in your web browser or in the R Studio viewer.
# `midwest` is a build-in R dataset
p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box")
p
plotlyplotly.js supports some chart types that ggplot2 doesn’t (Refer to cheatsheet for more chart types). We can create any of these charts via plot_ly().# `volcano` is a build-in R dataset.
plot_ly(z = ~volcano, type = "surface")
plotly supports faceting. However, comparing to ggplot2, plotly cannot handle more than 9 facets.# `diamonds` is a build-in R dataset
library(plotly)
set.seed(100)
d <- diamonds[sample(nrow(diamonds), 1000), ]
plot_ly(d, x = ~carat, y = ~price, color = ~carat,size = ~carat, text = ~paste("Clarity: ", clarity))
plotly has super fantastic graphin tools for 3D plots. Let us play with mtcar dataset.mtcars$am[which(mtcars$am == 0)] <- 'Automatic'
mtcars$am[which(mtcars$am == 1)] <- 'Manual'
mtcars$am <- as.factor(mtcars$am)
p <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Gross horsepower'),
zaxis = list(title = '1/4 mile time')))
p
We can change colors for each scales as well:
p <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec,
marker = list(color = ~mpg, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Gross horsepower'),
zaxis = list(title = '1/4 mile time')),
annotations = list(
x = 1.13,
y = 1.05,
text = 'Miles/(US) gallon',
xref = 'paper',
yref = 'paper',
showarrow = FALSE
))
p
ggplot2 and plotly can give us beautiful animated graphs.First, let us see how ggplotly() works:
# `gapminder` is a build-in R dataset in package `gapminder` (make sure to install `gapminder` package first)
data(gapminder, package = "gapminder")
gg <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
geom_point(aes(size = pop, frame = year, ids = country)) +
scale_x_log10()
ggplotly(gg)
Then let us see how plotly deals with key frame animation.
# `gapminder` is a build-in R dataset in package `gapminder` (make sure to install `gapminder` package first)
base <- gapminder %>%
plot_ly(x = ~gdpPercap, y = ~lifeExp, size = ~pop,
text = ~country, hoverinfo = "text") %>%
layout(xaxis = list(type = "log"))
base %>%
add_markers(color = ~continent, frame = ~year, ids = ~country) %>%
animation_opts(1000, easing = "elastic", redraw = FALSE) %>%
animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "bottom"
) %>%
animation_slider(
currentvalue = list(prefix = "YEAR ", font = list(color="red"))
)
Notice that there is not a default “play” button shown on plotly results. That is because plotly has more powerful commands for interactive, web-based plots. We can use animation_button(), animation_slider() and others to customize our plots.
ggplot2 graphing packages easier to deal with data. In other words, long-format data is more technical.reshape2 provides two key functions to transform data formats: melt() and cast.ggplot2, there is another powerful graphing package called plotly. plotly provides an interactive and web-based method for data visualization.